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Abstract 

Background: Missing values commonly occur in the microarray data, which usually contain more than 5% missing 
values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of 
downstream microarray data analyses. Many types of methods have been developed to estimate missing values. 
Among them, the regression-based methods are very popular and have been shown to perform better than the 
other types of methods in many testing microarray datasets. 

Results: To further improve the performances of the regression-based methods, we propose shrinkage regression- 
based methods. Our methods take the advantage of the correlation structure in the microarray data and select 
similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least 
squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and 
then use the new coefficients to estimate missing values. Simulation results show that the proposed methods 
provide more accurate missing value estimation in six testing microarray datasets than the existing regression- 
based methods do. 

Conclusions: Imputation of missing values is a very important aspect of microarray data analyses because most of 
the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for 
estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods 
can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based 
methods. 



Background 

Nowadays microarray technique has become an important 
and useful tool in functional genomics research. This high 
throughput technique allows the characterization of the 
gene expression of the whole genome by measuring the 
relative transcript levels of thousands of genes in various 
experimental conditions or time points [1]. Microarray 
data analyses have been widely used to investigate various 
biological processes such as the cell cycle process [2-8] 
and the stress response [9,10]. 

Although the microarray technology has been devel- 
oped for more than a decade, typical microarray data still 
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contain more than 5% missing values with up to 90% of 
genes affected [11]. Missing values could be generated by 
various reasons, including technological failures, adminis- 
trative error, insufficient resolution, image corruption, 
dust or scratches on the slide [12]. As many downstream 
analysis methods (such as gene clustering, disease classi- 
fication and gene network reconstruction) require com- 
plete datasets, missing value estimation becomes an 
important pre-processing step in the microarray data 
analysis [11-13]. 

The missing values in the microarray dataset are tradi- 
tionally estimated by repeating the microarray experiments 
or simply replacing the missing values with zero or the row 
average (the average expression over the experimental 
conditions). Because these approaches are either time- 
consuming or leading to serious estimation errors, more 
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advanced missing value imputation methods are needed to 
solve the missing value problems. In 2001, Troyanskaya 
et al published the first two missing value imputation algo- 
rithms based on the k-nearest neighbors (kNNimpute) and 
the singular value decomposition (SVDimpute) [12]. Since 
then, a lot of missing value imputation methods have been 
proposed such as Bayesian principal component analysis 
(BPCA) [14], Gaussian mixture clustering imputation 
(GMCimpute) [11], conditional ordered list imputation 
[15], random-forest-based imputation [16] and so on. 

Among the existing missing value imputation methods, 
the regression-based methods are very popular and con- 
tain many algorithms, including least squares imputation 
(LSimpute) [17], local least squares imputation (LLSim- 
pute) [18], sequential local least squares imputation 
(SLLSimpute) [19], and iterated local least squares impu- 
tation (ILLSimpute) [13]. LSimpute estimates the missing 
values in the target gene by using a weighted average of 
the k estimates from the k most similar genes. Each esti- 
mate is attained by constructing a single regression 
model of the target gene by a similar gene. LLSimpute 
represents the target gene as a linear combination of k 
similar genes by a multiple regression model and uses 
the regression coefficients to estimate the missing values. 
SLLSimpute modifies the LLSimpute by estimating the 
missing values sequentially from the gene containing the 
fewest missing values and partially utilizing these esti- 
mated values. ILLSimpute modifies the LLSimpute by 
not choosing the similar genes with a fixed number k but 
defining the similar genes as the genes whose distances 
from the target gene are less than a distance threshold 
and then runs LLSimpute iteratively. 

In this study, we focus on the regression-based meth- 
ods because these methods have been shown to have bet- 
ter performances than the other existing methods in 
many testing microarray datasets [20,21]. To further 
improve the performance of the regression-based meth- 
ods, we propose shrinkage regression-based methods 
which use a shrinkage estimator to replace the least 
square estimator for the estimation of the regression 
coefficients in the regression model. The shrinkage esti- 
mator such as the James- Stein estimator has been shown 
to dominate the least square estimator in many statistical 
models [22,23]. By adopting our new regression coeffi- 
cients in the regression-based methods, we showed that 
an improvement on missing value estimation in six test- 
ing microarray datasets could be achieved. 

Methods 

In this study, we propose using the well-known shrinkage 
estimation approach to improve three existing regression- 
based methods (LLSimpute [18], SLLSimpute [19], and 
ILLSimpute [13]) for missing value estimation. We call 



our proposed methods the shrinkage regression-based 
methods (see Figure 1). In the following subsections, we 
first introduce the shrinkage estimation approach and 
then describe the proposed shrinkage LLSimpute, shrink- 
age SLLSimpute, and shrinkage ILLSimpute. 

Shrinkage estimation approach 

One of the shrinkage estimators, the James-Stein estima- 
tor, for the normal distribution is introduced here. 
Suppose that Y lf Y 2 , Y k are independent normal random 
variables and these k random variables all have a common 
known variance, but their means are unknown and differ- 
ent. Let Yi ~ N{6 b a 2 ) and Y = (Y lf Y k ). Then we have 
Y ~ N(0, (J 1 !), where 0 = (6 1} 6 k ) and I is a k x k identity 
matrix. Let d(Y) = (^i(Y), d k {Y)) be an estimator of 0. 
Under the squared error loss function 

k 

L (0 - d 00) = (°i ~ di 00) 2 = ||0 — d(Y) 1 1 2 , (1) 

i=i 

we are interested in finding estimators of 0 such that the 
mean squared error E Y [L (0, d(Y))] is minimized. An intui- 
tive estimator of 0 is Y (i.e. § { = Yi, i = 1, . . . , fe). How- 
ever, Stein [22] showed that when k > 3, there exists other 
estimators with smaller mean squared error than the intui- 
tive estimator Y. For k > 3, under the squared error loss, 
the intuitive estimator Y is dominated by the estimator 



( ^ 

Original regression-based methods 
(LLSimpute, SLLSimpute and ILLSimpute) 



1 . Construct a regression model 

2. Use the least squares principle to estimate the 
coefficients of the regression model 

3. Utilize a shrinkage estimation approach to adjust 
the coefficients of the regression model 

4. Adopt the new coefficients to estimate the missing values 



r 

Shrinkage regression-based methods 
(Shrinkage LLSimpute, shrinkage SLLSimpute, 
and shrinkage ILLSimpute) 

Figure 1 The shrinkage regression-based methods. 

I J 



Wang et al. BMC Systems Biology 2013, 7(Suppl 6):S1 1 
http://www.biomedcentral.eom/1 752-0509/7/S6/S1 1 



Page 3 of 1 2 



where S$ = Ya=i Y f t 23 l- The estimator in (2) is called 
the James-Stein estimator in the literature [23]. With 
the form in (2), the James-Stein estimator of 6i is 



1 



fe- 



S 2 

Y 



(3) 



It is worth noting that the estimator of 9 t in (3) depends 
on not only the random variable Y b but also the other vari- 
ables Y lf Yi_ lf Yi + i, Y k because of the term S|. On the 
contrary, the intuitive estimator Q { = Yi does not use the 
other variables Y v Y t _ 1} Y t + lf Y k but only uses Y t to 
estimate 6 t . It has been shown that estimators using other 
variables' information provide more accurate estimation for 
0 than the intuitive estimator does [22]. In fact, except for 
the estimator in (3), the estimators of the form 



1 



S 2 



(4) 



all have uniformly smaller mean squared error than 
the intuitive estimator Y b for k > 3 and 0 <c <2 (k - 2). 
Among all the estimators of the form in (4), the estima- 
tor in (3) has the minimized mean squared error. The 
shrinkage estimation approach has also been shown to 
have good performance in interval estimation [24,25]. 
Based on the James-Stein estimator in (3), we developed 
shrinkage regression-based imputation methods. 

Notations 

In a typical microarray data matrix, the rows are the genes 
under investigation and the columns are the experimental 
conditions or time points. The microarray data matrix is 
obtained by performing a series of experiments on the 
same set of genes. We use G e IR m xn to represent a 
microarray data matrix with m genes and n experiments, 
and assume m^> n which is true for microarray data. In 
the matrix G, a row gj eR 1 xn represents the expressions 
of the ith gene in n experiments: 



G = 



/gT\ 



(5) 



where denotes the transpose of a column vector gi. 
If there is a missing value in the /th position of the ith 
gene, we denote it as a, i.e. G bi = g a = a. 

Shrinkage local least squares imputation (Shrinkage 
LLSimpute) 

In the LLSimpute method [18], a target gene with miss- 
ing values is represented as a linear combination of k 
similar genes. Rather than using all genes in the dataset, 
only k genes with high similarity to the target gene are 



used. The procedure of selecting k similar genes is as 
follows. Suppose that the target gene is the first gene 
and has a missing value a in the first position, i.e. a= 
gn in the matrix G e M m x n . The Pearson correlation 
coefficient is used to find the k similar genes. These k 
similar genes are called the /c-nearest neighbor genes, 
which have the k largest absolute values of the Pearson 
correlation coefficients. The Pearson correlation coeffi- 
cient r ij between the target gene and the yth gene is 
defined as 



1 



E 



git & ~ 



(6) 



where gj and (7 ; denote the average and the sample 
standard deviation of the vector (gj 2 , gj n ). When com- 
puting the correlation coefficients, gj X is not used because 
it corresponds to the position of the missing value in the 
target gene. Based on these selected /c-nearest neighbor 
genes, a matrix A g IR fe x ( n_1 ) an d two vectors b g IR fe x 1 
and w ^ IR( n_1 ) x 1 can be formed as follows 



/gT\ 

T 



a w 
b A 



h A u Ai, 2 Ai, n _i 
\b k Afe,i A ki2 A k , n -iJ 



where ads the missing value in the target gene and 
Ssi ' • • • ' Ss k are the k- nearest nieghbor genes of the target 
gene g 1# Each row of matrix A consists of the last n - 1 
elements of one /c-nearest neighbor gene g Si , 1 < i < k. 
The elements of the vector b comprise of the first ele- 
ments of all these /^-nearest neighbor genes and the ele- 
ments of the vector w are the last n - 1 elements of the 
target gene gi. With the matrix A, and the vectors b 
and w, the least squares problem is formulated in 
LLSimpute as 



min ||A x — w| 



2- 



(7) 



Solving the above problem, the least square regression 
coefficients x e R kxl are acquired as 



x = (xi,x 2 , . . . , h) = (AA ) Aw. 



(8) 



In the LLSimpute, the missing value is then 
estimated by 



a = b T x = x\bi + x 2 b 2 + . . . + hh- 



(9) 



In this study, we want to improve the performance of 
LLSimpute by adjusting the regression coefficients in (8). 
Our shrinkage LLSimpute associates the LLSimpute 
method with the shrinkage estimator to impute the miss- 
ing values. Our method replaces the regression coefficient 
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estimators £ in (8) by the shrinkage estimator, and then 
use the new estimator to estimate the missing value a in 
(9). However, we found that applying the existing shrink- 
age estimator in (3) did not always improve the perfor- 
mance of LLSimpute. Therefore, we tested different forms 
of the shrinkage coefficient estimators and conceived a 
feasible coefficient estimator to improve the LLSimpute 
method. We proposed using the shrinkage regression coef- 
ficients 



1 



(fe - 2)a- 
nS 2 



(10) 



to replace the conventional coefficients in (8), where 
o 2 is the variance of the coefficients (x\,X2, ... , ifc), S 
is the norm of the coefficients (i.e. S 2 = Y^=i x ?)> ^ * s tne 
row number of the matrix A and h is the column num- 
ber of the matrix A, which equals n - 1 in this case. 
Finally, the missing value is estimated as 



a = b T x JS = xf bi + ^b 2 + . . . + jcfb fe 
where x /s = (xf, xf) T . 



(11) 



Shrinkage sequential local least squares imputation 
(Shrinkage SLLSimpute) 

In the LLSimpute, it does not use the information of genes 
with missing values since the existence of missing values 
hinders the use of the other observed values of that gene. 
In the SLLSimpute method, it estimates the missing values 
sequentially from the gene containing the fewest missing 
values and partially utilizes these estimated values. The 
details of SLLSimpute [19] is described as follow. First, the 
microarray matrix G e R m x n is divided into two subma- 
trices: a complete matrix Gi e R miXn consisting of genes 
without missing values and an incomplete matrix 
G 2 g IR( m-mi ) xn consisting of genes with missing values. 
In the incomplete matrix G 2 , the genes are sorted by their 
missing rates. The first gene has the smallest missing rate 
and the last gene has the largest missing rate. The missing 
rate is calculated by 



(12) 



where c t is the number of missing values in /-th gene. 
The imputation is executed sequentially from the first 
gene of G 2 . That is, the first gene of G 2 which has the 
smallest missing rate is selected as the target gene firstly. 
Then LLSimpute is applied to estimate the missing values 
in the target gene by finding the /c-nearest neighbour 
genes from the complete matrix Gi and then using the 
formula in (9) to estimate the missing values. After filling 
all the missing values in the target gene, it is moved to 
Gi. Then the second gene of G 2 is selected as the target 
gene and repeat the same process again. By moving the 



genes whose missing values have been imputed to the 
complete matrix, the previous target genes with imputed 
values can be utilized for the missing value estimation of 
the following target gene. However, too many missing 
values in a gene will result in big estimation error and 
reusing a gene with too many imputed values will reduce 
the imputation performance. Therefore, only the genes 
with missing rates less than a threshold r 0 are reused, 
where r 0 is set as the average missing rate of all genes 
containing missing values, i.e., 



To 



Em— 7 
i=l 



(m — mi) x n 



(13) 



By a similar argument as for the shrinkage LLSimpute, 
we apply the shrinkage estimator to SLLSimpute. The 
shrinkage SLLSimpute adjusts the coefficients of the 
regression model by the formula in (10) and use the for- 
mula in (11) to estimate the missing values. 

Shrinkage iterated local least squares imputation 
(Shrinkage ILLSimpute) 

LLSimpute and SLLSimpute methods select /^-nearest 
neighbor genes for a target gene, where k is a fixed num- 
ber. However, in the ILLSimpute method [13], it does 
not fix the number of similar genes selected. Alterna- 
tively, it defines the similar genes as the genes whose dis- 
tances to the target genes are less than a distance 
threshold <$. The rationale of using a distance threshold 
rather than using a fixed number of similar genes is that 
some of the /c-nearest neighbor genes are already far 
away from the target gene and are not very similar to the 
target gene. 

The procedure of ILLSimpute is as follows. In the first 
iteration, missing values of each target gene are filled 
with the row average. Then a distance threshold 8 is used 
to select the similar genes of each target gene. Finally, 
LLSimpute method is used to estimate the missing values 
of each target gene. In the later iteration, ILLSimpute 
method uses the imputed results from the previous itera- 
tion to reselect the similar genes of each target gene 
(using the same distance threshold) and applies LLSim- 
pute method to re-estimate the missing values. 

By a similar argument as for the shrinkage LLSimpute, 
we apply the shrinkage estimator to ILLSimpute. The 
shrinkage ILLSimpute adjusts the coefficients of the 
regression model by the formula in (10) and use the for- 
mula in (11) to estimate the missing values. 

Results and Discussion 

We conducted several experiments to compare the per- 
formances of our shrinkage regression-based methods 
and the original regression-based methods under differ- 
ent scenarios. In the first subsection, we introduce the 
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Table 1 Benchmark datasets. 



Name 


Dimension of 
original 
datasets 


Dimension of 
reduced complete 
datasets 


Time 
series 
data 


Ref. 


Ogawa 


6263 x 8 


3069 x 8 


N 


[26] 


BohenSH 


2364 x 24 


623 x 24 


N 


[27] 


Lymphoma 


4096 x 96 


854 x 96 


N 


[28] 


Brauer05 


6133 x 20 


706 x 20 


Y 


[29] 


Shapira04A 


4771 x 23 


2970 x 23 


Y 


[30] 


Shapira04B 


4771 x 14 


3340 x 14 


Y 


[30] 



benchmark datasets. In the second subsection, we 
describe how we measure the performance of various 
imputation methods. In the following three subsections, 
we report the comparison results for different number 
of similar genes used, different missing rates, and differ- 
ent noise levels. Finally, we further compare the perfor- 
mances of our shrinkage regressioni-based methods and 
three existing non-regression-based methods. 

Datasets 

Considering the effects of dataset selection and types of 
microarray experiments on the performance of an impu- 
tation method, six representative datasets (three non- 
time series and three time series) were used in our 
simulations. They were Ogawas data from the study of 
phosphophate accumulation and poly-phosphophate 
metabolism (denoted as Ogawa, non-time series) [26], 
Bohen's follicular lymphomas data (denoted as 
BohenSH, non-time series) [27], the data from a lym- 
phoma study (denoted as Lymphoma, non-time series) 
[28], the data from Brauers experiments which studied 
the physiological response to glucose limitation in batch 
and steady-state cultures of yeasts (denoted as Brauer05, 
time series) [29], and Shapira's oxidative stress data 
(denoted as Shapira04A and Shapira04B, time series) 
[30]. We divided Shapira's data into two datasets 
because the authors used one kind of oxidative chemical 
in the experiment in Shapira04A, but they used another 
kind of oxidative chemical in the experiment in Sha- 
pira04B. The six microarray datasets were used as 
benchmark datasets in numerical experiments to com- 
pare the performances of our shrinkage regression-based 
methods and the original regression-based methods. 
Each dataset was processed by deleting the genes with 
missing values to generate a complete data matrix, and 
the details of these datasets were listed in Table 1. 



The performance measure 

A common criterion used to compare the performances 
of different imputation methods is the normalized root 
mean squared error (NRMSE) [11-13,17-19]. From a 
microarray dataset, we can obtain an original data matrix 
M 0 with m genes and n experiments, and then we can 
construct a complete matrix Mi e IR mi x n (m\ < m) by 
deleting the genes with missing values. After the com- 
plete data matrix Mi is established, we randomly select a 
specific percentage of the elements of Mi and regard 
these elements as missing values. Then we estimate the 
missing values using various imputation methods and 
compare their performances using NRMSE which is 
shown below: 

Jmean[{y -y ans ) 2 ] 

NRMSE = ± £ t 14 ) 

5 ^(y a ns) 

where y gue ss and y ans are vectors whose elements are 
the estimated values by an imputation method and the 
known answers for all missing entries, respectively. 

Performance comparison for different k values 

A parameter /c, the number of similar genes used, has to 
be determined before using two regression-based meth- 
ods (LLSimpute and SLLSimpute). Since the perfor- 
mance of both algorithms is known to be affected by 
the k value used and different microarray datasets may 
have different optimal /rvalues [18,19], we tested several 
possible k values (50, 100, 150, 200, 250 and 300) on six 
benchmark datasets. Table 2 listed the optimal k values 
for LLSimpute and SLLSimpute on each of the six 
benchmark datasets. Another regression-based method 
(ILLSimpute) does not have the parameter k and there- 
fore was not considered in this numerical experiment. 

For each of the six benchmark dataset, we also com- 
pared the performances of the proposed shrinkage 
regression-based methods and the original regression- 
based methods for several possible k values (50, 100, 150, 
200, 250 and 300). In our numerical experiments, missing 
rate for each benchmark dataset was set to be 5%. 
Namely, for each dataset, we randomly removed 5% 
entries of the complete matrix to generate a matrix with 
missing values, and then estimated the missing values 
using the shrinkage and the original regression-based 
methods. The same procedure was run for five indepen- 
dent rounds and the average NRMSE of these five 



Table 2 The optimal k value for each benchmark dataset. 



Algorithms\Datasets 


Ogawa 


BohenSH 


Lymphoma 


Brauer05 


Shapira04A 


Shapira04B 


LLS 


100 


250 


300 


300 


250 


200 


SLLS 


150 


300 


250 


300 


250 


200 
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Non-time series 

Ogawa 



Time series 

Brauer05 












Shapira04A 






Method 






Mathod 




▲ »r.r us 



















The number of similar genes used (k) The number of similar genes used (k) 

Figure 2 Performance comparison between shrinkage LLS (shr_LLS) and LLS for different k values. 



simulations was used to compare the performances of 
different imputation methods. 

As shown in Figure 2, the proposed shrinkage LLSim- 
pute outperforms LLSimpute for all k values and all 
benchmark datasets. Similarly, the proposed shrinkage 
SLLSimpute outperforms SLLSimpute for all k values 
and all benchmark datasets (see Figure 3). The simula- 
tion results suggest that utilizing a shrinkage estimation 



approach to adjust the coefficients of the regression 
model can improve the performances of the original 
regression-based methods. 

Performance comparison for different missing rates 

In real applications, different microarray data may have dif- 
ferent missing rates to be imputed. It is informative to 
know how an imputation method performs for different 



Non-time series 

Ogawa 





BohenSH 




Lymphoma 




Time series 

Brauer05 




Shapira04A 




Shapira04B 




The number of similar genes used (k) The number of similar genes used (k) 

Figure 3 Performance comparison between shrinkage SLLS (shr_SLLS) and SLLS for different K values. 
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Non-time series Time series 





Shapira04B 



Missing Rate (%) 



Missing Rate (°/o) 

Figure 4 Performance comparison between shrinkage LLS (shr_LLS) and LLS for different missing rates. 



missing rates. Therefore, we compared the performances of 
the shrinkage regression-based methods and the original 
regression-based methods on the microarray data with dif- 
ferent missing rates (1%, 5%, 10%, 15% and 20%). Namely, 
for each of the six benchmark dataset, we randomly 
removed x% (x - 1, 5, 10, 15 or 20) entries of the complete 
matrix to generate a matrix with missing values, and then 
estimated the missing values using the shrinkage and the 
original regression-based methods. The same procedure 



was run for five independent rounds and the average 
NRMSE of these five simulations was used to compare the 
performances of different imputation methods. Note that 
the optimal k value used for each benchmark dataset was 
listed in Table 2. 

Figure 4 shows that the proposed shrinkage LLSimpute 
outperforms LLSimpute for all missing rates and all 
benchmark datasets. Figure 5 shows that the proposed 
shrinkage SLLSimpute outperforms SLLSimpute for all 



Non-time series 

Ogawa 




BohenSH 




Lymphoma 





Shapira04A 




• SLLS 



Shapira04B 




Missing rato (%) Missing rate {%) 

Figure 5 Performance comparison between shrinkage SLLS (shr_SLLS) and SLLS for different missing rates. 
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missing rates and all benchmark datasets. Figure 6 shows 
that the proposed shrinkage ILLSimpute outperforms 
ILLS impute for all missing rates and all benchmark data- 
sets. The simulation results suggest that utilizing a 
shrinkage estimation approach to adjust the coefficients 
of the regression model can improve the performances of 
the original regression-based methods. 

Performance comparison for different noise levels 

In real applications, different microarray data may contain 
different levels of noises. It is informative to know how an 
imputation method performs for different levels of noises 
inherent in the microarray data. Therefore, we compared 
the performances of the shrinkage regression-based meth- 
ods and the original regression-based methods on the 
microarray data with different noise levels. For each of the 
six benchmark dataset, we added Gaussian noises with dif- 
ferent levels into the data. The magnitudes of the noises 
were set in terms of the standard deviations ranging from 0 
to 0.25 with a step size 0.05. In our numerical experiments, 
missing rate for each benchmark dataset was set to be 5% 
and the optimal k value used for each benchmark dataset 
was listed in Table 2. Namely, for each dataset (after adding 
Gaussian noises into the data), we randomly removed 5% 
entries of the complete matrix to generate a matrix with 
missing values, and then estimated the missing values using 
the shrinkage and the original regression-based methods. 
The same procedure was run for five independent rounds 
and the average NRMSE of these five simulations was used 
to compare the performance of different imputation 
methods. 



Figure 7 shows that the proposed shrinkage LLSimpute 
outperforms LLSimpute for all noise levels and all bench- 
mark datasets. Figure 8 shows that the proposed shrinkage 
SLLSimpute outperforms SLLSimpute for all noise levels 
and all benchmark datasets. Figure 9 shows that the pro- 
posed shrinkage ILLSimpute outperforms ILLSimpute for 
all noise levels and all benchmark datasets. The simulation 
results suggest that utilizing a shrinkage estimation 
approach to adjust the coefficients of the regression model 
can improve the performances of the original regression- 
based methods. 

Performance comparison with three existing non- 
regression-based methods 

We have shown that our shrinkage regression-based 
methods perform better than the existing regression-based 
methods. Still, it would be interesting to know whether 
our shrinkage regression-based methods provide more 
accurate missing value imputation than the existing non- 
regression-based methods do. Therefore, we compared the 
performances of our shrinkage regression-based methods 
and three existing non-regression-based methods 
(kNNimpute [12], SVDimpute [12], and BPCA [14]) on 
the six benchmark microarray datasets. As shown in 
Figures 10, 11, 12, the proposed shrinkage regression- 
based methods outperform these three existing non- 
regression-based methods for almost all missing rates and 
all benchmark datasets. Taken together, our shrinkage 
regression-based methods are competitive alternatives to 
the existing methods for microarray missing value 
imputation. 
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Figure 7 Performance comparison between shrinkage LLS (shr_LLS) and LLS for different noise levels. 



Conclusions 

Imputation of missing values is a very important aspect 
of microarray data analyses because most of downstream 
analyses require a complete dataset. Therefore, exploring 
accurate and efficient methods for estimating missing 
values has become an essential issue. In this study, 
regression-based methods associated with a shrinkage 
estimation approach are proposed to estimate missing 



values in the microarray data. Our methods take the 
advantage of the correlation structure existing in the 
microarray data and select similar genes for the target 
gene by Pearson correlation coefficients. Besides, our 
methods incorporate the least squares principle, utilize a 
shrinkage estimation approach to adjust the coefficients 
of the regression model, and apply the new coefficients of 
the regression model to estimate missing values. 
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Figure 9 Performance comparison between shrinkage ILLS (shrJLLS) and ILLS for different noise levels. 
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Figure 10 Performance comparison between shrinkage LLS (shr_LLS) and three non-regression-based methods for different missing 
rates. 



Wang et al. BMC Systems Biology 2013, 7(Suppl 6):S1 1 
http://www.biomedcentral.eom/1 752-0509/7/S6/S1 1 



Page 11 of 12 




Musing Rate (%> Miung Rate l%l 



Figure 1 1 Performance comparison between shrinkage SLLS (shr_SLLS) and three non-regression-based methods for different missing rates. 
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Figure 12 Performance comparison between shrinkage ILLS (shrJLLS) and three non-regression-based methods for different missing rates. 
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Simulation results show that the proposed shrinkage 
regression-based methods provide more accurate missing 
value estimation for various types of datasets than the 
original regression-based methods do. Since our pro- 
posed methods can be applied to modify any kind of 
regression-based methods and can provide accurate miss- 
ing value estimation, they are competitive alternatives to 
the existing regression-based methods. 
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